This course give as a general overview over the possibilities to work with data. No matter the topic, all scientists need to work with lots of data, organize it and analyze it in order to be able to understand it, explain it and represent it. This course will give us the basic tools to do so. Mainly working with R programming language and R-studio software which are probably the most widely used open source tools thanks to its robustness for statistical computing and graphics.
The core values of the course are based on the importance of open data and knowledge sharing.
Link to GitHub repository: https://github.com/elirams/IODS-project
In this chapter we will go through how to read, interpret and represent data. Then we will learn how to apply and interpret simple and multiple regressions. And finally we will go through the most important methods to validate regression models.
First we will read the data called students2014, explore its structure and dimensions:
learning2014<-read.csv("C:/Users/Elisabet/Desktop/ELI/MASTER'S DEGREE/Open Data Science/GitHub/IODS-project/IODS-project/learning2014.csv", header = TRUE)
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
We can see the our data has 7 variables (columns), and 166 observations. The data is about the different approaches of learning by students. The variables measured are:
Gender: It has 2 levels: M for male, F for female
Age: age of the students. Integral data type variable.
Attitude: global attitude towards statistics.
Deep: stands for deep approach learning and it is a combination of variables including: seeking meaning, relating ideas and use of evidence. This variables measure sto what extent the student has the intention to maximize understanding, with a true commitment to learning.
Stra: stands for strategic approach and it is a combination of variables including: organized studying and time management. This variables measures to what extent the students organize their studying
Surf: stands for surface approach learning and it is a combination of variables including: lack of purpose, unrelated memorising and syllabus-boundness.This variable measures to what extent the student memorizes without understanding, with a serious lack of personal engagement in the learning process.
Points: the exam points that students obtained. It is an integral data type. The maximum points that student could get in the exam is 30 plus some extra points from participating in the study, and the minimum points for passing is 12.
The variables attitude, deep, stra and surf are numerical data type with ranging values from 1 to 5. 1 meaning that the student is very far from one specific approach and 5 that the student is very close to that approach.
We will do a graphical overview of the data and show the summaries of the variables, but first we install the packages and bring them to R studio:
library(ggplot2)
library(GGally)
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
For the graphical overview of the data we will use ggpairs plot from GGally package (which is an extension of the ggplot2), because it is a more advance plot which show us at the same the distribution of the data as well as the relationships between variables. It creates scatter plots and give the correlation value. We specify the combo argument in the lower list because we have both discrete and continuous variables. Mapping describes the aesthetic parameters and in this case we specify to colour male and female.
p <- ggpairs(learning2014, mapping = aes(col=gender), lower=list(combo=wrap("facethist", bins=20)))
p
On the summary of our data we can observe how the variables are distributed.For gender, we have 110 female and 56 male. The average age of respondants is 25.5, with a minimum age of 17 and a maximum of 55 years old. Most of the respondants have ages from 21 to 27 years old. The mean attitude towards statistics is 3.1, women has higher mean attitude than men. Mean deep approach is 3.6 and there is no big differences between men and women. Strategic approach has a mean of 3.1, and woman has slightly higher points. Surface approach learning has a mean of 2.7 and there is no significant differences between men and woman. The average exam points is 22.7, lowest points is 7 and maximum is 33.
With the ggpairs plot we can see graphically the distribution of the variables in the middle diagonal line, sepparated for men (blue) and women (pink). We can also see on the top row a box plot showing the median, first and third quartile for each of the variables. On the left side we have the scatter plots and on the right side the correlations values. We can observe that the variables with clearly higher correlations is attitude with points, with 0.44. Then exam points are also slightly related with strategic and surface approach with 0.15 and 0.14 correlation, respectively.
Now we choose three variables as explanatory variables of the target variable (dependent) which is the exam points. Based on the correlation values, we will choose as explanatory variables: attitude, stra and surf.
First we write the model, since we have more than one variable it is a multiple regression with the formula like y ~ x1 + x2 + x3.
my_model <- lm(points ~ attitude + stra + surf, data=learning2014)
Let’s look at the summary of our model:
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
On the output we see a table where the first column shows the estimation of each of the coefficients or parameters of the model, then the standard error for these estimations. Finally the T-test value and the p-value to accept or reject the nulle hypothesis. Based on the results, the only variable with p-value close enough to zero to conclude that the nulle hypothesis is false and therefore affirm that there is statistical relationship between variables, is attitude. Therefore stra and surf are not statistically related to points.
We will then rewrite the model including only attitude as the explanatory variable:
my_model2 <- lm(points ~ attitude, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
So our final model now looks like this:
y = points X = attitude
\[y = 11.6 + 3.5x\]
\[y = A + Bx\]
A is the point where the regression line intercepts the Y axis and B is the slope of the regression and it shows the relationship between x and y. So in this case when attitude increases one unit, the points increases 3.5 units. Based on what we have previously explained, we can see that there is statistical significance to say that attitude is related to points.
The multiple R squared of the model measure the amount of variation in the target variable that can be explained by the explanatory variables. In this case then, the variable attitude explains 19% of the variation in exam points.
The model assumes that the residuals (difference between predicted values and real values) have a constant variance, so that they don’t depend on the explanatory variable. In the following plot we can see how points are really randomly distributed, so that they don’t follow any pattern. For this reason we can say that the assumption is reasonable.
plot(my_model2, which = 1)
The model assumes that the residuals are normally distributed. With the following plot we can prove if that assumption is correct. In this case, we can see how the points fall pretty well into the line for almost all values, therefore we can say that the normality assumption is reasonable.
plot(my_model2, which = 2)
Finally, in this plot we can identify if there is single observations which have an unusually high impact in the model. If there exists some single observations which differs so much from the rest, we say that it has high leverage. In this case, we can see how the leverage of the points is similar, there is no single points which have much higher leverage than the others. Therefore, we have regular leverage and that gives more validity to our model.
plot(my_model2, which = 5)
In this chapter we will analyse the relationships between the target variable, high/low alcohol consumption, and other variables. Because the target variable is discrete (2 values, TRUE or FALSE), we will use logistic regressions in order to do it.
We obtained the original data from the Machine Learning Repository. It is a combination of 2 data sets, student performance in mathematics and in portuguese, from the Student Performance Data Set. It contains 35 variables and 382 observations for each one of them.
We have information about the students and its background, such as sex, age, family size, parents’ education, parents’job, etc.
It also contains information about the students’ grades, the most rellevant is G3 which relates to the final grades.
Because in this exercise we are interested in studying the high consumption of alcohol among students, we created a variable called “alcohol use” (alc_use), which is the average alcohol consumption of the week and the weekend.
Finally we created the variable that we will use as target variable in this exercise, which is “high alcohol consumption”" (high_use). This variable is TRUE for “alcohol use” higher than 2 and FALSE otherwise.
For further information about the variables you can check: https://archive.ics.uci.edu/ml/datasets/Student+Performance
alc <- read.table("C:/Users/Elisabet/Desktop/ELI/MASTER'S DEGREE/Open Data Science/GitHub/IODS-project/IODS-project/alc.csv", header = TRUE, sep = ",")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
In order to study the relationships between high alcohol consumption and some other variables, we will choose 4 variables which may have some kind of relationship with the target variable:
*Failures: number of past class failures. Numeric from 0 to 4. Assuming a positive correlation with the target variable. The more alcohol consumption, the more failures.
*Absences: number of school absences. Numeric from 0 to 93. Also assuming positive correlation, higher alcohol consumption, more absences.
*Final Grades: final grade. Numeric from 0 to 20. Assuming a negative correlation, higher alcohol consumption, lower grades.
*Romantic: if the student is in a romantic relationship. Binary, yes or no. Assuming that students in a romantic relationship are more centered in the relationship and so, they do not go out so much to drink with friends.
First we will look at the distribution of our chosen variables with bar plots.
We can see that the target variable, high_use, is distributed so that less than half of the students have actually high alcohol consumption rates. Absences are quite sparsely distributed with many students with very low amount of absences but also many students with high amount of absences. The failures figures speaks by itself, the vast majority of students have not failed any class in the past. The grades are fairly sparsely distributed but there is higher amount of students with lower grades. Finally we can see that a bit less than half of the students are in a romantic relationship.
chosen_variables <- subset(alc, select = c("high_use", "failures", "absences", "G3", "romantic"))
library(tidyr)
library(ggplot2)
gather(chosen_variables) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
With a barplot of high_use and romantic, we can see that proportionality, students in a romantic relationship have less amount of high alcohol consumption.
g1 <- ggplot(alc, aes(x = high_use))
g1 + geom_bar() + facet_wrap("romantic")
Next we will draw some box plots with high_use as the dependent variable and failures, absences, grades and romantic as the dependent ones.
For the first one, with high use vs grades, we also check the differences between males and females.
g1 <- ggplot(alc, aes(x= high_use, y = G3, col = sex))
g1 + geom_boxplot() + ylab("Grades") + ggtitle("Grades by alcohol consumption and sex")
In the second one we can actually observe that those in a relationship have more absences, but most important, it supports our hypothesis that students with high alcohol consumption have more absences.
g2 <- ggplot(alc, aes(x= high_use, y = absences, col = romantic))
g2 + geom_boxplot() + ylab("Absences") + ggtitle("Absences by alcohol consumption and relationship")
This last graph is quite surprising, seeing that the only significant relationship is with students in a relationship and failures. We don’t see a relationship between failures and high/low alcohol consumption as we hypothesized. Most probably due the fact that failures variable was so skewed to 0 failures.
g3 <- ggplot(alc, aes(x= high_use, y = failures, col = romantic))
g3 + geom_boxplot() + ylab("Failures") + ggtitle("Failures by alcohol consumption and relationship")
We can also look at the differences nummerically.
First comparing target variable with final grades. We see that students with higher alcohol consumption have lower mean final grades, which supports our hypothesis.
alc %>% group_by (high_use) %>% summarise (count = n(), mean_grade = mean (G3))
## # A tibble: 2 x 3
## high_use count mean_grade
## <lgl> <int> <dbl>
## 1 FALSE 268 11.73507
## 2 TRUE 114 10.80702
We can also count how many of the students in a relationship or not have high/low alcohol consumption. And see the mean grades for each combination. The results support our initial hypothesis and confirm numerically what we could see on the previous graph. Proportionally, students with boyfriend/girlfriend has lower alcohol consumption (27% = 33/(33+88)) than those without (31% = 81/(180+81)).
Again we can see that students with high alcohol consumption have lower grades. And it is actually funny to see that students in a relationship have also lower grades, but that is not rellevant in our study.
alc %>% group_by (romantic, high_use) %>% summarise (count = n(), mean_grade = mean (G3))
## # A tibble: 4 x 4
## # Groups: romantic [?]
## romantic high_use count mean_grade
## <fctr> <lgl> <int> <dbl>
## 1 no FALSE 180 12.00556
## 2 no TRUE 81 11.00000
## 3 yes FALSE 88 11.18182
## 4 yes TRUE 33 10.33333
In the next table we can see how high alcohol consumption has lower mean grades, higher absences and higher failures.
alc %>% group_by (high_use) %>% summarise (count = n(), mean_grade = mean (G3), mean_absences = mean(absences), mean_failures = mean(failures))
## # A tibble: 2 x 5
## high_use count mean_grade mean_absences mean_failures
## <lgl> <int> <dbl> <dbl> <dbl>
## 1 FALSE 268 11.73507 3.705224 0.1417910
## 2 TRUE 114 10.80702 6.368421 0.3421053
If we add to the previous table the romantic variable, we can observe numerically what we have seen on the previous graphs. Students in a romantic relationship have slightly lower grades. Moreover, on the one hand, students with girl/boyfriend have higher absences and failures when they have high alcohol use, but on the other hand , when they have low alcohol use, they have higher absences but lower failures.
alc %>% group_by (high_use, romantic) %>% summarise (count = n(), mean_grade = mean (G3), mean_absences = mean(absences), mean_failures = mean(failures))
## # A tibble: 4 x 6
## # Groups: high_use [?]
## high_use romantic count mean_grade mean_absences mean_failures
## <lgl> <fctr> <int> <dbl> <dbl> <dbl>
## 1 FALSE no 180 12.00556 3.405556 0.1444444
## 2 FALSE yes 88 11.18182 4.318182 0.1363636
## 3 TRUE no 81 11.00000 5.777778 0.3086420
## 4 TRUE yes 33 10.33333 7.818182 0.4242424
After looking at the data and the possible relationships between variables, now we are going to create a model with the chosen variables, and see if the hypothesis we have are actually statistically confirmed or rejected.
So our model would be:
\[ high-use = romantic + grades + failures + absences \]
The results of the logistic regression clearly shows that we did not choose the most proper variables, because what the previous exploration of the data suggested is not what we see now after applying the regression model. In our model, the only statistically significant variables are absences and failures, leaving romantic and grades out. Most probably choosing romantic as an explanatory variable biased the whole model. So if required, it would be adequate to perform again the regression with different variables and check if we could obtain a better model to explain high/low alcohol consumption.
m <- glm(high_use ~ romantic + absences + failures + G3, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ romantic + absences + failures + G3,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2106 -0.7928 -0.7038 1.1546 1.9020
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.67924 0.49112 -1.383 0.166652
## romanticyes -0.34058 0.25850 -1.318 0.187658
## absences 0.08623 0.02282 3.779 0.000157 ***
## failures 0.39863 0.20005 1.993 0.046301 *
## G3 -0.05083 0.03818 -1.332 0.183020
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 436.93 on 377 degrees of freedom
## AIC: 446.93
##
## Number of Fisher Scoring iterations: 4
Next we will interpret the coefficients of the model as odd ratios and provide confidence intervals for them.
The odds ratio quantifies the relationship between X and Y. In this case Y is high alcohol use and X are the explanatory variables used in our model: romantic, absences, failures and grades.
For example, if we take the X as romantic, the odds ratio measures the odds of having high alcohol consumption when the student is in a relationship divided the odds of having high alcohol consumption when the students is single.
\[OR = odds (Y | X) / Odds (Y | without X )\]
The computational target variable is in logistic regression, the log of the odds. So if we apply exponential function to the fitted values, we will obtain the odds.
So in the OR that we have computed we can observe that Students in a relationship are only 0.7 as likely to be high alcohol consumers than those without. Or in other words, single students are 1.4 times more likely to be high alcohol consumers (to see the odd ratio for “romanticno”, we just calculate the inverse, so 0.71^-1 = 1.41).
We must interpret the odds ratio for non-binary variables as the ratio of being high alcohol consumer and one unit of change in variable X, divided by the ratio of being high alcohol consumer and no change in variable X:
\[ exp(coef) = Odds(Y | X+1)/Odds(Y | X) \]
Therefore, with the odds ratio of absences, we can interpret that a student with one more absence is 1.09 more likely to be high alcohol consumer than the student who don’t have that one more absence. Of course, with failures, the possible values are more limited (only from 0 to 4), so the change is bigger. The student with one more failure is 1.5 times more likely to be high alcohol consumer than the student without that extra failure.
Grades is the only variable with OR and CI lower than 1, meaning that the correlation is negative. The student having 1 grade higher is 1.05 less likely to be high alcohol consumer than the student without that 1 grade higher, e.g.: the student with grade 4 is 5% less likely to be high alcohol consumer than the one with grade 3.
The confidence intervals shows us where are the vast majority of observations. So for the romantic variable, the CI is very wide, that is why we have not found statistical correlation. For both romantic and G3, the CI go from lower than 1 to higher than 1, so it is unclear if the correlation with target variable is postive or negative, so that is another reason that they cannot be significant.
Therefore we confirm our hypothesis that high alcohol consumption is positively related to absences and failures. But reject our hypothesis that high alcohol consumption is related anyhow to romantic relationships and grades.
# To compute OR (odds ratio) and CI (confidence intervals:
OR <- coef(m) %>% exp()
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
# Print them out:
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.5070040 0.1908492 1.316612
## romanticyes 0.7113585 0.4240480 1.171181
## absences 1.0900618 1.0446648 1.142606
## failures 1.4897812 1.0079375 2.221726
## G3 0.9504370 0.8816401 1.024407
Now, we will select only the significant variables (based on the previous logistic regression model) and create a new model to explore the predictive power of our model.
First we write the new model, then we predict the probability of high use based on the response of our model. After we add the probabilities and the predictions to our data set. Predictions would be true only when prob > 0.5. Finally we make a cross tabulation.
On the first table we can observe that for high use consumers, there is only 15 where prediction would be true and 99 false, so in 99 of the cases the prediction said it would not be a high alcohol consumer when in fact it is. We can also observe than in 10 of the cases the model predicted the student to be high alcohol consumer when he/she was not. That shows already the weak prediction power of our model.
On the second table we can see it in percentages, only 3.9% of predictions are true for high alcohol use students and 68% of predictions are true for low alcohol use students.
To sum up, our model does the right prediction in 71.4% of the cases which is reasonable but could be improved.
# We write the new model:
m2 <- glm(high_use ~ failures + absences, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ failures + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2680 -0.8019 -0.6967 1.2522 1.7906
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.37829 0.16479 -8.364 < 2e-16 ***
## failures 0.49972 0.18456 2.708 0.006776 **
## absences 0.08602 0.02287 3.762 0.000169 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 440.14 on 379 degrees of freedom
## AIC: 446.14
##
## Number of Fisher Scoring iterations: 4
# We predict the probability of high_use, based on the response of our model:
probabilities <- predict(m2, type = "response")
# We add the probabilities to our data set:
alc <- mutate(alc, probability = probabilities)
# Now we use these probabilities to make a prediction of high use (the prediction will only be true if probability is higher than 0.5):
alc <- mutate(alc, prediction = probability > 0.5)
# Let's tabulate the predictions versus the actual values:
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 258 10
## TRUE 99 15
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67539267 0.02617801 0.70157068
## TRUE 0.25916230 0.03926702 0.29842932
## Sum 0.93455497 0.06544503 1.00000000
We can also draw a plot to see the true values of high_use against the predictions:
# We can also see it graphically:
g <- ggplot(alc, aes(x=probability, y=high_use, col=prediction))
g + geom_point()
Finally, to measure the accuracy of our model, we will compute the total proportion of inaccurately classified individuals (=incorrect predictions in the training data). We will define a loss function for this purpose and then compute the average number of wrong predictions in the training data. We obtained that 0.29 or 29% of predictions are wrong.
The performance of the model is reasonable but one can guess that we could have found a better model for the prediction of high/low alcohol consumers. The simple guessing strategy gave us quite good understanding but still we were wrong for 2 of the 4 variables so, that is questionable.
loss_func <- function(class, prob){
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class=alc$high_use, prob=alc$probability)
## [1] 0.2853403
As we have seen, statistical models can be used to make predictions . Cross-validation aims to estimate how accurately a predictive model will perform in practice.
In cross-validation, we calculate the same loss function as before, but with different data than the one used for defining the model (=testing data). Or what is the same, we can calculate the average of wrong predictions in the testing data.
# 10-fold cross-validation:
library(boot)
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m2, K=10)
# Average number of wrong prediction in the testing data with a 10-fold cross validation:
cv$delta[1]
## [1] 0.2879581
The average number of wrong predictions in cross validation for my model, is 0.29, which is higher than the error from example model in Datacamp, 0.26. So my model has worse test set performance, as one could have expected at this point of the exercise.
We can make different logistic regression models and see their performance, looking at the training at testing errors. In the following models (model 3 to model 7) we have started with 6 explanatory variables, and every next model we erase the less significant variable.
Looking at the errors, we cannot see any tendency if either having more or less variables improves the model performance.
Model 3 (6 predictors):
# Model 3, 6 predictors:
m3 <- glm(high_use ~ sex + failures + absences + G3 + romantic + goout, data = alc, family = "binomial")
summary(m3)
##
## Call:
## glm(formula = high_use ~ sex + failures + absences + G3 + romantic +
## goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1319 -0.7948 -0.5292 0.8075 2.4357
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.66129 0.72329 -5.062 4.15e-07 ***
## sexM 0.90410 0.25705 3.517 0.000436 ***
## failures 0.29568 0.22493 1.315 0.188669
## absences 0.08370 0.02243 3.732 0.000190 ***
## G3 -0.02973 0.04204 -0.707 0.479487
## romanticyes -0.30928 0.27986 -1.105 0.269103
## goout 0.69962 0.12050 5.806 6.40e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 383.45 on 375 degrees of freedom
## AIC: 397.45
##
## Number of Fisher Scoring iterations: 4
# Adding probability and prediction columns to our data
probabilities3 <- predict(m3, type = "response")
alc <- mutate(alc, probability3 = probabilities3)
alc <- mutate(alc, prediction3 = probability3 > 0.5)
#Training error m3:
loss_func(class=alc$high_use, prob=alc$probability3)
## [1] 0.2015707
#Testing error m3:
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m3, K=10)
cv$delta[1]
## [1] 0.2094241
Model 4 (5 predictors):
# Model 4, 5 predictors:
m4 <- glm(high_use ~ sex + failures + absences + romantic + goout, data = alc, family = "binomial")
summary(m4)
##
## Call:
## glm(formula = high_use ~ sex + failures + absences + romantic +
## goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0766 -0.7901 -0.5395 0.7706 2.4403
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.05277 0.47663 -8.503 < 2e-16 ***
## sexM 0.90227 0.25695 3.511 0.000446 ***
## failures 0.35231 0.21070 1.672 0.094508 .
## absences 0.08466 0.02244 3.773 0.000161 ***
## romanticyes -0.29147 0.27850 -1.047 0.295294
## goout 0.70951 0.11992 5.916 3.29e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 383.95 on 376 degrees of freedom
## AIC: 395.95
##
## Number of Fisher Scoring iterations: 4
# Adding probability and prediction columns to our data
probabilities4 <- predict(m4, type = "response")
alc <- mutate(alc, probability4 = probabilities4)
alc <- mutate(alc, prediction4 = probability4 > 0.5)
#Training error m4:
loss_func(class=alc$high_use, prob=alc$probability4)
## [1] 0.2172775
#Testing error m4:
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m4, K=10)
cv$delta[1]
## [1] 0.2146597
Model 5 (4 predictors):
# Model 5, 4 predictors:
m5 <- glm(high_use ~ sex + failures + absences + goout, data = alc, family = "binomial")
summary(m5)
##
## Call:
## glm(formula = high_use ~ sex + failures + absences + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0299 -0.7890 -0.5417 0.7857 2.3588
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.13390 0.47292 -8.741 < 2e-16 ***
## sexM 0.91999 0.25625 3.590 0.000330 ***
## failures 0.34560 0.21093 1.638 0.101330
## absences 0.08243 0.02235 3.688 0.000226 ***
## goout 0.70797 0.11998 5.901 3.62e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 385.06 on 377 degrees of freedom
## AIC: 395.06
##
## Number of Fisher Scoring iterations: 4
# Adding probability and prediction columns to our data
probabilities5 <- predict(m5, type = "response")
alc <- mutate(alc, probability5 = probabilities5)
alc <- mutate(alc, prediction5 = probability5 > 0.5)
#Training error m5:
loss_func(class=alc$high_use, prob=alc$probability5)
## [1] 0.2146597
#Testing error m5:
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m5, K=10)
cv$delta[1]
## [1] 0.2146597
Model 6 (3 predictors):
# Model 6, 3 predictors:
m6 <- glm(high_use ~ sex + absences + goout, data = alc, family = "binomial")
summary(m6)
##
## Call:
## glm(formula = high_use ~ sex + absences + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7871 -0.8153 -0.5446 0.8357 2.4740
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.16317 0.47506 -8.764 < 2e-16 ***
## sexM 0.95872 0.25459 3.766 0.000166 ***
## absences 0.08418 0.02237 3.764 0.000167 ***
## goout 0.72981 0.11970 6.097 1.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 387.75 on 378 degrees of freedom
## AIC: 395.75
##
## Number of Fisher Scoring iterations: 4
# Adding probability and prediction columns to our data
probabilities6 <- predict(m6, type = "response")
alc <- mutate(alc, probability6 = probabilities6)
alc <- mutate(alc, prediction6 = probability6 > 0.5)
#Training error m6:
loss_func(class=alc$high_use, prob=alc$probability6)
## [1] 0.2094241
#Testing error m6:
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m6, K=10)
cv$delta[1]
## [1] 0.2172775
Model 7 (2 predictors):
# Model 7, 2 predictors:
m7 <- glm(high_use ~ sex + goout, data = alc, family = "binomial")
summary(m7)
##
## Call:
## glm(formula = high_use ~ sex + goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5599 -0.8726 -0.6260 0.8382 2.4893
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8130 0.4529 -8.420 < 2e-16 ***
## sexM 0.8736 0.2466 3.543 0.000396 ***
## goout 0.7609 0.1181 6.441 1.19e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 402.74 on 379 degrees of freedom
## AIC: 408.74
##
## Number of Fisher Scoring iterations: 4
# Adding probability and prediction columns to our data
probabilities7 <- predict(m7, type = "response")
alc <- mutate(alc, probability7 = probabilities7)
alc <- mutate(alc, prediction7 = probability7 > 0.5)
#Training error m7:
loss_func(class=alc$high_use, prob=alc$probability7)
## [1] 0.2094241
#Testing error m7:
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m7, K=10)
cv$delta[1]
## [1] 0.2277487
In this chapter we will learn how to classify data using linear discriminant analysis and k-means, one of the most known methods of clustering.
We will use the dataset called Boston from the MASS package. The data has 506 observations from 14 variables. The dataset gives different housing values in suburbs of Boston. E.g.: per capita crime rate by town, proportion of industries per town, nitrogen oxides concentration, etc. More information about the data can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Next we can see a graphical overview of the data and the summaries of the variables.
Because we have so many variables, it would be hard to do a scatter plot for each combination of variables, an easier way to see what is the relationship between variables is making a correlation matrix.
In our matrix we can see visually which variables are more related and if they are positively or negatively related (with the colours and size of the circles), and we can also see the correlations numerically.
The highest correlation we find it between rad and tax, which means index of accessibility to radial highways and full-value property-tax rate, so obviously, the tax rate depends a lot on whether the property is easily accessible by radial highways or not. We can see how many other variables are also well correlated in the graphic.
install.packages(“corrplot”) install.packages(“tidyverse”)
library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 1.3.4 v purrr 0.2.4
## v readr 1.1.1 v stringr 1.2.0
## v tibble 1.3.4 v forcats 0.2.0
## -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
cor_matrix <- cor(Boston) %>% round(digits=2)
corrplot.mixed(cor_matrix, lower.col = "black", number.cex = .6)
We can see a summary of the variables in the following tables. For example for variable crime we can see the mean crime rate per capita is 3.6 and the max 88.9 which looks quite alarming. Or the mean concentration of nitrogen oxides is 0.55ppm.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Finally in the following graphs we can visually see how the variables are distributed.
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Linear Discriminant analysis makes the assumption that variables are normally distributed and the variances of each variable is the same, that is why it is better to scale the data before fitting our model.
We can see next the summaries of the scaled data. Note how the values have changed, compared to the previous summary. Now the mean for each of the variables is 0. So that indicates that all the variables have the same scale now.
After scaling, we change the object to data frame format, so we can use it later as data.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
First we will create a categorical variable of the crime rate in the Boston dataset. For that we will use the quantiles as the break points in the categorical variable.
# First we create the quantile vector of crim:
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# Then we create the categorical variable "crime":
crime <- cut(boston_scaled$crim, breaks = bins, inlcude.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 126 126 126 127
Then we will remove the old variable “crim” and add the new one “crime” to the dataset.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Later, we divide the dataset to train and test sets, so that 80% of the data belongs to the train set. This will allow us to make predictions later in this chapter.
n <- nrow(boston_scaled)
# Choose randomly 80% of the rows:
ind <- sample(n, size = n*0.8)
# create the train set;
train <- boston_scaled[ind,]
# create the test set:
test <- boston_scaled[-ind,]
Now we fit the linear discriminant analysis (LDA) on the train dataset, the target variable will be crime and all the other variables as predictors.
lda.fit <- lda(crime ~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2729529 0.2282878 0.2506203 0.2481390
##
## Group means:
## zn indus chas nox rm
## low 0.98812310 -0.8879687 -0.09336997 -0.8871939 0.4714092
## med_low -0.07175141 -0.3294507 0.07002747 -0.5986490 -0.1560526
## med_high -0.38450487 0.2017264 0.19544522 0.3671334 0.1687034
## high -0.48724019 1.0171519 -0.03610305 1.0666847 -0.3933784
## age dis rad tax ptratio
## low -0.9050065 0.8643467 -0.6999747 -0.7373119 -0.44472543
## med_low -0.4156825 0.4395242 -0.5511961 -0.5508926 -0.06983352
## med_high 0.3885408 -0.3872968 -0.4065006 -0.3061260 -0.30233717
## high 0.8369823 -0.8810285 1.6377820 1.5138081 0.78037363
## black lstat medv
## low 0.3759752 -0.77353786 0.58268470
## med_low 0.3184127 -0.13668930 -0.01715826
## med_high 0.1045196 -0.02438746 0.24209790
## high -0.8198416 0.89642650 -0.68465098
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12059047 0.657963145 -0.925492771
## indus -0.03895302 -0.219939386 0.378467785
## chas -0.07254555 -0.019715704 0.178843594
## nox 0.47603352 -0.843484307 -1.150640685
## rm -0.10041649 -0.116543736 -0.232167256
## age 0.25198706 -0.310347467 -0.129805347
## dis -0.09028332 -0.284022805 0.362710972
## rad 3.06696536 1.042658747 0.326030339
## tax 0.08177250 -0.067288339 0.006671975
## ptratio 0.15196179 -0.004030454 -0.268923700
## black -0.15755101 0.025159085 0.118944987
## lstat 0.24795396 -0.181397937 0.337433463
## medv 0.20232902 -0.370250495 -0.250966583
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9502 0.0368 0.0130
Finally we draw the LDA biplot to visualize how the data is distributed. Each class have a different colour and the arrows represent the predictor variables, whose lenght and direction are based on the coefficients and show their impact on the model.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch= classes)
lda.arrows(lda.fit, myscale = 1)
First we will save the crime categories from the test set and then remove the crime variable from the test dataset.
# we save the correct classes from test data:
correct_classes <- test$crime
# remove the crime variable from the test data:
test <- dplyr::select(test, -crime)
Based on the trained model, the LDA model we created calculates the probability of new observations to belong in each of the classes and it classifies it to the class with the highest probability.
We predict the classes with LDA model on the test data.
lda.pred <- predict(lda.fit, newdata = test)
Finally we cross tabulate the results of our predictions using the crime categories from the test set that we saved at the beginning of this 3.4. We can observe how most of predictions are correct, only 24 of the predictions are incorrect. If we look at the probabilities, we can see how 75.4% of the predictions are correct.
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 8 8 0 0
## med_low 5 15 14 0
## med_high 0 7 17 1
## high 0 0 0 27
table(correct = correct_classes, predicted = lda.pred$class) %>% prop.table() %>% addmargins()
## predicted
## correct low med_low med_high high Sum
## low 0.078431373 0.078431373 0.000000000 0.000000000 0.156862745
## med_low 0.049019608 0.147058824 0.137254902 0.000000000 0.333333333
## med_high 0.000000000 0.068627451 0.166666667 0.009803922 0.245098039
## high 0.000000000 0.000000000 0.000000000 0.264705882 0.264705882
## Sum 0.127450980 0.294117647 0.303921569 0.274509804 1.000000000
First we scale the dataset Boston.
boston_scaled2 <- scale(Boston)
summary(boston_scaled2)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled2 <- as.data.frame(boston_scaled2)
We calculate the distances between the observations. We use the most common, euclidean distance.
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Then we run the k-means algorithm on the dataset. First we put a random number of clusters or initial cluster centers.
km <- kmeans(boston_scaled2, centers = 3)
Now we will investigate what is the optimal number of clusters. For this purpose we will first set a max number of clusters to 10. Then we will use one of the most used methods for deciding the number of cluster, sum of squares.The sum of squares or total within cluster sum of squares (TWCSS), is the sum of within cluster sum of squares (WCSS). So when you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.
# avoid the kmeans to give us every time different results
set.seed(123)
# set the maximum number of clusters to 10
k_max <- 10
# calculate the total sum of squares:
twcss <- sapply(1:k_max, function(k){kmeans (boston_scaled2, k)$tot.withinss})
# see graphically the optimal number of clusters:
qplot(x = 1:k_max, y = twcss, geom = 'line')
From the above graph we can prove how the optimal number of clusters is 2. Soo we run the k means algorithm again and visualize the results. First we see a matrix with all the variables and then zoomed for variables from 1 to 5 and from 6 to 10. On the following graphics we can observe the two clusters on the data, in red and black, and how they relate for each combination of variables
km <- kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2, col=km$cluster)
pairs(boston_scaled2[1:5], col=km$cluster)
pairs(boston_scaled2[6:10], col=km$cluster)
First we scale the Boston dataset.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
boston_scaled3 <- scale(Boston)
summary(boston_scaled3)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Then we perform k-means.
km2 <- kmeans(boston_scaled3, center = 3)
boston_scaled3 <- as.data.frame(boston_scaled3)
Then we perform LDA using the clusters as target classes.
lda.fit2 <- lda(km2$cluster ~., data=boston_scaled3)
Visualize the results. We can observe how the most influential variables which separate the clusters are accessibility to radial highways (rad), nitrogen oxides concentration (nox) and proportion of residential land zoned for lots over 25,000 sq.ft (zn).
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(km2$cluster)
# plot the lda results
plot(lda.fit2, dimen = 2, col=classes, pch= classes)
lda.arrows(lda.fit, myscale = 1)
First we run the code below for the scaled train data that we used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
Then we draw a plot.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
Then we draw the same plot put assigning the colors to be the crime classes of the train set.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
Finally, we draw the same plot but now setting the colors to be the clusters of the k-means.
{r 3d plot3} plot_ly(x = matrix_product\(LD1, y = matrix_product\)LD2, z = matrix_product\(LD3, type= 'scatter3d', mode='markers', color = km\)cluster)
In the second graphic we can identify which observations belong to which crime classes and we can observe there is some kind of clustering within classes, especially with the class high, there is a clear cluster. The rest are not as clear.
In the last graphic we should observe the graphic where each color is a cluster, and as we could have guessed, the optimal is only 2 clusters. One for high and another one for the rest of crime classes. For some reason we could not proceed to draw this graphic.
In many cases, datsets contain huge amount of variables which are difficult to interpret. With dimensionality reduction techniques we can reduce the amount of variables so it is easy to interpret the data. In this chapter we will use two of the must used techniques, PCA (Principal Component Analysis) and MCA (Multiple Correspondance Analysis).
The data used for this exercise originates from the United Nations Development Programme. For more informartion about the data you can visit : http://hdr.undp.org/en/content/human-development-index-hdi
First we load the data and explore its structure and dimensions. We can see the data has 155 observations of 8 variables.
We have the following variables in our dataset:
human <- read.table(file="C:/Users/Elisabet/Desktop/ELI/MASTER'S DEGREE/Open Data Science/GitHub/IODS-project/IODS-project/Data/human.csv", header = TRUE, row.names = 1)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ ratio.edu2 : num 1.007 0.997 0.983 0.989 0.969 ...
## $ ratio.labforce: num 0.891 0.819 0.825 0.884 0.829 ...
## $ yr.eduexp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ lifexp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ matermor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ adobirth : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ repreparl : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ ratio.edu2 : num 1.007 0.997 0.983 0.989 0.969 ...
## $ ratio.labforce: num 0.891 0.819 0.825 0.884 0.829 ...
## $ yr.eduexp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ lifexp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ matermor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ adobirth : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ repreparl : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 8
First we can see in the summaries that both ratio of female vs male with at least secondary school and in the labour force are under 1, meaning that there is higher amount of men with at least secondary school and more man in the labour force. The average years of expected schooling is 13 years and the average life expectancy at birth is 71.7 years. Mean GNI per capita is 17 628USD.
Maternal mortality ratio has minimum of 1 and a maximum of 1100 deaths per 100,000 births. The rationale is that countries where maternal mortality ratios exceed 1 000 they are not capable to make a difference by supporting maternal health. The mean is 149.
Adolescent birth rate, so adolescent pregnancies are in average 47 births per 1 000 women ages 15-19.
Finally, percetange of female representatives in parliament is in average 21% with a maximum of 57.5%.
summary(human)
## ratio.edu2 ratio.labforce yr.eduexp lifexp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI matermor adobirth repreparl
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
A good way of visualizing the distributions of the variables is using the ggpairs function.
We can observe for example that the big majority of countries is on the low maternal mortality, which is definetely good news. On the other hand most of countries are in the lower end of GNI. The years of expected schooling follows pretty much a normal distribution and even slightly skewed to the more years of schooling, which seems very positive. Life expectancy at birth is skewed to more years of life expectancy at birth but still there is big amount of countries with less than 60 years.
{r ggpairs human}
library(GGally)
ggpairs(human)
We can also explore the relationships between varibles using a correlation matrix. In the previous correlation numbers and in the following matrix we can see that obviously life expectancy at birth and maternal mortality are strongly and negatively correlated. Also maternal mortality is strongly and negatively correlated with secondary education ratio and years of expected education, which strongly supports the fact of education being an essential tool to avoid deaths at birth. It is also good to see how life expectancy is positively correlated with years of expected education.
library(corrplot)
humancor <- cor(human)
corrplot(humancor)
PCA transforms the data to less dimensions (new features) which are called principal components(PC). In PCA, the first PC captures the maximum amount of variance from the features in the original data. The second PC is not correlated with the first one and it captures the maximum amount of variability left. And so on, all the PC are uncorrelated to each other and each is less important than the previous one in terms of variability.
So now we will apply PCA on the not standardized human data.
pca_human <- prcomp(human)
We look at the variability captured by the principal components. It is pretty clear that all variability is captured completely by the first principal component.
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
pca_pr <- round(100*s$importance[2,], digits=1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
Then we draw a biplot displaying the observations by the two first principal components. We can observe on the graphic below how the GNI variable is strongly contributing to the first principal component.
biplot(pca_human, choices = 1:2, cex = c(0.6, 1))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Now we will do the same as we just did but first we will standardize the data.
Unlike LDA, PCA has no target variable. PCA is sensitive to the original scaling of the original features (=variables). And it assumes that features with higher variance are more important than the ones with less variance. That is why standardization is advisable before performing the analysis. So let’s see if we can get more clear and visible results after scaling the data.
So now, with the scaled data, the first principal component captures 53% of the variability, the second captures 16%, the third 9.6%, etc. Which seems much more reasonable, that the variability is more distributed among the PC and not only to the first one as it happened on the previous PCA.
human <- scale(human)
pca_human <- prcomp(human)
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
pca_pr <- round(100*s$importance[2,], digits=1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
Let’s draw the biplot with the scaled data, now it is much more visible the contribution of each of the variables. And it seems logical that all variables contribute to the analysis, not only one.
In the biplot graphic, the observations are located in a scattered plot placed on x and y coordinates defined by the two first PC (in this case). And it shows the original variables and their relationships with both each other and the principal components. The angle between features shows the correlation between features and so smaller angle, higher positive correlation. The angle between features and PC axis show the correlation between the two of them. The lenght of the arrows is proportional to the SD of the features.
So we can observe how representation in the parlament is positively correlated with ratio of labour force; expected years of education, ratio of secondary education and life expectancy are positively correlated; and adolescent births and maternal mortality are positively correlated. On the other hand, adolescent birth together with maternal mortality are negativerly correlated to expected years of education, ratio education, GNI and life expectancy.
On the one hand, the variables maternal mortality, adolescent birth, years of expected education, ratio of secondary education and life expectancy are contributing to the first principal component. On the other hand, the variables ratio of labour force and percentage of female representatives in the parliament are contributing to the second principal component.
Almost all variables are contributing equally to the PC1 and PC2, only years of expected education and life expectancy have slightly higher contribution.
biplot(pca_human, choices = 1:2, cex = c(0.6, 0.9))
We can also draw a biplot showing the variability captured by the first two components in the axis labels. And change the colors.
pca_lab <- paste0(names(pca_pr), "(", pca_pr, "%)")
biplot(pca_human, cex = c(0.6, 1), col=c("grey40", "deeppink2"), xlab= pca_lab[1], ylab = pca_lab[2])
MCA analyses the pattern of relationships of several categorical variables. Also continuous variables can be used as background (supplementary) variables. MCA can be used with qualitative data and it uses frequencies. MCA has little assumptions about the variables of the data.
We will use the dataset tea in order to use MCA. So let’s first visualize the distributions of the variables and explore briefly the dataset.
The tea dataset has 300 observations of 36 variables. It is a dataset with information about tea drinking habits, for example if the people that has responded the questionnaires takes tea on breakfast, lunch or dinner, if they drink with friends or not, which type of tea, if they add or not sugar, etc.
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.4.3
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
To make it easier for the analysis we will select few variables of the data set and look at their distributions. Most of the interviewees do not have tea at lunch, most of them uses tea bags and most of them drink it alone without adding any lemon of milk. They mainly drink earl grey and it is almost fifty-fifty if they put sugar or not. Most of them gets their tea in the chain stores.
library(dplyr)
library(tidyr)
library(ggplot2)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Now we will apply MCA on the tea data. The eigenvalues show the variances and percentage of the variances by each dimension, so the first dimension captures 15% of the variance of the data.
Then we can see the individuals table: the individual coordinates, the contribution in percentage on the dimension and the squared correlations (=cos2) on the dimensions. So the individual 1 contribute 0.11 to the dimension 1, 0.14 to dimension 2 and 0.16 to dimension 3.
The categories shows the coordinates of the variable categories, the contribution (%), the cos2 and the v.test value. This last one follows the normal distributions and is it is higher or lower than + or - 1.96 it means that the coordinate is significantly different from 0. We can see how all of the categories do.
In the categorical variables we can see the squarred correlation between each variable and the dimensions. So if it is close to 1, it shows strong link with variable and dimension. For example “how” and “where” has 0.7 to dimension 1 so they are quite strongly related.
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
Finally we will draw a MCA factor map where the variables are draw on the first 2 dimensions so we can visualize the possible variable patterns. The distance between the variable categories gives a measure of their similarity. Very close = very similar.
plot(mca, invisible = "ind", habillage = "quali")
plot(mca, invisible = "var", habillage = "quali")